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We  present  two  graph-based  algorithms  for  multiclass  segmentation  of  high-dimensional 
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uses  a  reduction  of  GL  minimization,  based  on  the  Merriman-Bence-Osher  scheme  for  mo¬ 
tion  by  mean  curvature.  These  yield  accurate  and  efficient  algorithms  for  semi-supervised 
learning.  Our  algorithms  outperform  existing  methods,  including  supervised  learning  ap¬ 
proaches,  on  the  benchmark  datasets  that  we  used.  We  refer  to  Garcia-Cardona  (2014)  for 
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Convex  splitting 


1.  Introduction 

Multiclass  segmentation  is  a  well  studied  problem  in  machine  learning  and  computer  vision.  An  energy  formulated  for 
the  problem  is  often  made  up  of  a  regularization  term  and  a  fidelity  term.  Recently,  graph-based  regularization  terms  have 
been  used  to  take  into  account  the  similarities  in  the  dataset.  We  present  two  graph-based  segmentation  procedures  inspired 
by  [1],  which  describes  a  binary  segmentation  method  consisting  of  minimizing  the  Ginzburg-Landau  functional  on  graphs 
using  gradient  descent.  The  first  method  extends  this  procedure  to  a  multiclass  case  using  the  idea  of  the  Gibbs  simplex. 
The  second  method  applies  the  simplex  idea  to  the  graph-based  Merriman-Bence-Osher  (MBO)  scheme  developed  in  [2]. 
Both  algorithms  result  in  efficient  and  accurate  ways  to  perform  multiclass  segmentation.  A  more  detailed  explanation  of 
the  methods,  intended  for  a  computer  science  audience,  is  contained  in  [3],  while  here  we  follow  an  applied  mathematics 
approach.  We  also  present  different  results  using  other  benchmark  datasets  than  the  ones  in  [3]. 

1.1.  Ginzburg-Landau  functional  and  diffuse  interface  model 

The  Ginzburg-Landau  (GL)  functional,  originally  proposed  to  describe  physical  phenomena,  is  formulated  as: 


(1) 


*  Corresponding  author.  Tel.:  +1  3104902193. 

E-mail  addresses:  kmerkurjev@gmail.com,  kmerkurev@math.ucla.edu  (E.  Merkurjev),  cristina.cgarcia@gmail.com  (C.  Garcia-Cardona), 
bertozzi@math.ucla.edu  (A.L.  Bertozzi),  allon.percus@cgu.edu  (A.G.  Percus). 

http://dx.doi.Org/10.1016/j.aml.2014.02.008 
0893-9659 /©  2014  Elsevier  Ltd.  All  rights  reserved. 


Form  Approved 
OMB  No.  0704-0188 


Report  Documentation  Page 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
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where  u  denotes  the  state  of  the  phases,  V  represents  the  spatial  gradient  operator,  @(u)  is  a  double-well  potential,  such  as 
\(u2  —  l)2,  and  €  is  a  positive  constant. 

Kohn  et  al.  show  in  [4]  that  the  e  ->  0  limit  of  (1),  in  the  sense  of  T-convergence,  is  the  total  variation  semi-norm: 

GL(u )  ->► r  II tf  || tv •  (2) 

One  might  prefer  to  use  the  GL  functional  in  data  segmentation  because  its  L2  gradient  flow  results  in  a  linear  differential 
operator.  By  contrast,  the  TV  semi-norm  contains  a  nonlinear  curvature  term. 

The  diffuse  interface  description  has  been  used  successfully  in  image  inpainting  [5,6]  and  image  segmentation  [7].  The 
standard  practice  is  to  introduce  an  additional  fidelity  term  F  to  allow  for  the  specification  of  any  known  information  u: 


E(u )  =  GL(u)  +  F(u,  u). 

One  way  to  minimize  the  energy  is  using  gradient  descent,  producing  the  modified  Allen-Cahn  equation: 

du  SGL  SF  1  ,  SF 

—  = - /x —  =  cAu - 0  (u)  —  ii  — . 


(3) 

(4) 


\AA.  Graph  framework  for  large  datasets 

In  [  1  ],  Bertozzi  and  Flenner  outline  an  approach  for  binary  segmentation  using  the  Ginzburg-Landau  functional  in  a  graph 
domain  instead  of  the  continuous  domain  of  the  original  functional. 

We  consider  a  graph  setting  where  G  =  (V,  E)  is  an  undirected  graph  with  vertices  V  and  edges  E.  For  each  dataset,  the 
vertices  represent  its  building  blocks;  for  example,  in  the  case  of  an  image,  the  vertices  would  consist  of  pixels.  Each  edge 
contains  a  value  w(i,j)  assigned  to  it,  measuring  the  similarity  between  the  two  vertices  i  and  j  it  is  connecting.  In  this  work, 
we  use  the  Gaussian  function  w(i,j )  =  exp(—d(i,j)2/cr2)  as  a  similarity  measure.  Here  d(i,j )  represents  some  measure  of 
“distance”  between  vertices  i  and  j. 

If  one  defines  W  as  the  matrix  Wy  =  w(i,j ),  the  degree  of  a  vertex  i  e  V  as  d*  =  J2jev  w(z,  j),  and  D  to  be  the  diagonal 
matrix  with  elements  d*,  then  the  graph  Laplacian  is  formulated  as  I  =  D  —  W. 

\A.2.  Ginzburg-Landau  functional  on  graphs 

We  use  the  theory  of  nonlocal  calculus,  described  in  [8  ],  to  generalize  the  continuous  GL  formulation  to  a  graphical  setting. 
The  theory  relates  the  spacial  Laplace  operator  to  the  graph  Laplacian  matrix  of  the  previous  section.  In  fact,  the  eigenvectors 
of  the  discrete  Laplacian  converge  to  those  of  the  Laplacian  [1].  However,  in  the  limit  of  large  sample  size,  the  matrix  L  must 
be  scaled  correctly  to  guarantee  stability  of  convergence  to  the  continuum  differential  operator  [1].  A  scaling  we  use  is  the 
normalized  Laplacian 

Ls  =  D~hu =  I  -  D~iwD~i  (5) 

because  the  matrix  is  symmetric,  which  makes  the  linear  algebra  routines  we  use  more  efficient. 

The  GL  functional  on  graphs  is  formulated  as 

GL(u )  =  ^(u,Lsu)  +  0J^(uf-  1)\  (6) 

ieV 

where  uiy  the  zth  component  of  vector  iz,  is  the  state  of  node  z.  Here,  the  gradient  term  in  the  original  functional  ( 1 )  is  replaced 
by  a  more  general  operator  on  graphs.  The  second  term  is  a  double-well  potential  function  having  minima  at  ±1.  This 
representation  is  appropriate  for  binary  classifications  only. 

Note  that  one  benefit  of  using  a  graphical  framework  is  having  a  way  to  deal  with  the  case  of  nonlinearly  separable 
classes.  In  addition,  using  graphs  provides  a  straightforward  way  of  processing  high  dimensional  data. 

1. 1.3.  Semi-supervised  learning  (SSL)  on  graphs 

In  semi-supervised  learning  (SSL),  one  uses  the  knowledge  of  the  labels  of  a  fraction  of  the  nodes.  As  in  (3),  an  additional 
fidelity  term  is  introduced  to  specify  the  information: 

e(u)  =  |  (u,  M)  +  f  (u?  -  0  +  Y)  y  ( Ui  ~  fii)2  >  (y) 

ieV  ieV 

where  zr,,  the  zth  component  of  vector  u,  is  the  (real-valued)  state  of  node  z,  /x;-  is  a  parameter  that  is  equal  to  a  positive 
constant  /z  if  z  is  a  fidelity  node  and  to  zero  otherwise,  and  ut  is  the  known  value  of  fidelity  node  z. 

Thus,  given  an  initial  state  u\  of  each  vertex  z,  the  goal  is  to  minimize  the  GL  functional  with  fidelity  term.  The  classes  are 
obtained  by  thresholding  Up  We  extend  this  efficient  binary  data  segmentation  method  to  multiclass  segmentation  in  the 
following  sections. 
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2.  Multiclass  Ginzburg-Landau  approach 


2.1.  Extension  to  multiclass  segmentation 


Let  Nd  be  the  dimension  of  the  dataset  and  I<  the  number  of  classes.  To  extend  to  the  multiclass  case,  we  now  assign  to 
node  i  a  composition  of  states  U;  €  RKy  where  the  kth  component  of  u;  is  the  strength  that  the  node  belongs  to  class  k.  Note 
that  in  the  binary  case,  each  node  was  assigned  a  single  real  value  denoting  in  some  way  the  class  membership.  Also,  note 
that  u  is  now  a  ND  x  I<  matrix. 

We  require  the  vector  u to  be  an  element  of  the  Gibbs  simplex  HK,  represented  as: 


Z  :=  \(xu...,xK)  €  [0,  If 


E**  =  , 


k=  1 


(8) 


Vertices  of  the  simplex  are  elements  in  which  all  components  vanish,  except  one  which  is  equal  to  1.  These  elements  rep¬ 
resent  points  that  belong  exclusively  to  a  certain  class.  Note  that  the  simplex  formulation  has  a  straightforward  probabilistic 
interpretation,  since  each  row  of  u  is  the  probability  distribution  over  the  I<  classes. 

The  multiclass  GL  energy  functional  on  graphs  is  expressed  as: 


E(u)  = 


-  II 2 

Hi  —  Ui  ||  , 


where  u;  is  a  vector  indicating  prior  class  knowledge  of  node  i,  and  (u,  Lsu)  =  trace  (urLsu). 

The  advantage  of  the  simplex  representation  is  that,  unlike  the  case  with  some  multiclass  methods,  the  penalty  assigned 
to  neighbors  that  are  differently  labeled  is  independent  of  the  labels.  Note  that  in  the  second  term  of  the  energy,  we  use  an 
L\  norm  as  it  prevents  an  undesirable  minimum  from  occurring  at  the  center  of  the  simplex,  as  would  be  the  case  with  an 
L2  norm  for  large  I<.  This  potential  aims  to  provide  a  clear  way  to  calculate  class  memberships,  as  the  phase  composition  is 
purer  near  the  vertices  of  the  simplex.  The  third  (fidelity)  term  allows  the  user  to  input  any  a  priori  information. 


2.2.  Energy  minimization 


Similar  to  the  procedure  in  [1],  we  use  a  convex  splitting  scheme  for  the  minimization  of  the  energy: 

E( u)  =  ^convex  (u)  T"  ^concave  (u) 

€  C 

£convex(u)  =  —  (U,  Lsll)  +  —  (U,  U) 

^concave (u)  =  — —  ^  '  j  [  “  II Uf  —  T"  ^  '  ~  |Uj  —  Uf  —  “  (U,  ^ * 

Z€  ieV  /<=  1  4  ieV  Z  Z 

Here  C  €  M  is  a  constant  big  enough  so  the  concavity/convexity  of  the  above  terms  is  valid.  Under  the  right  conditions,  this 
approach  results  in  an  unconditionally  stable  scheme  for  gradient  descent  [7,9,10]  of  the  form 

n+l  ,  $ ^convex  /  n+l\  n  & ^concave  /  n, 

u  +  +  dt - (u  ^  )  =  u  -  dt - (u  ).  (9) 

S  u  S  u 

We  use  an  implicit  scheme  because  of  the  stiffness  of  the  differential  equations.  Otherwise,  dt  might  be  forced  to  be 
extremely  small  for  any  meaningful  solution.  We  solve  the  scheme  very  efficiently  using  spectral  methods  for  which  only  a 
small  number  of  eigenfunctions  are  needed  to  obtain  a  good  result.  Note  that  after  the  update,  the  solution  might  no  longer 
be  an  element  of  the  Gibbs  simplex,  so  we  use  the  procedure  in  [11]  to  project  the  phase  field  back  to  the  simplex.  For  ini¬ 
tialization,  we  use  a  random  point  on  the  simplex  for  a  non-fidelity  node,  and  the  corresponding  vertex  on  the  simplex  for 
a  fidelity  node. 

The  stopping  criterion  we  use  for  energy  minimization  is 


max  ||u”+1  —  up  ||2 

i 

max  ||u"+1 1|2 

i 


(10) 


Finally,  we  assign  node  i  to  class  k  if  u,  is  closest  to  vertex  e/<  on  the  simplex. 

Note  that  other  operator  splitting  methods  have  been  studied  for  minimization  problems  (e.g.  [12]).  Ours  however  has 
the  advantages:  (i)  it  is  direct  and  thus  does  not  require  one  to  solve  further  minimization  problems,  (ii)  one  can  adjust  the 
accuracy  by  changing  the  number  of  eigenfunctions,  and  (iii)  its  complexity  is  close  to  linear  in  |  V\.  Details  about  (iii)  can  be 
found  in  [3]. 
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3.  MBO  reduction  of  the  Ginzburg-Landau  energy  functional 

In  [13],  Merriman,  Bence  and  Osher  describe  a  simple  algorithm  to  approximate  motion  by  mean  curvature,  or  motion 
in  which  normal  velocity  equals  mean  curvature.  Their  algorithm  consists  of  alternating  between  the  following  steps: 

1.  Diffusion.  Let  un+^  =  S(St)un  where  S(St)  is  the  propagator  (by  time  St)  of  the  standard  heat  equation:  =  Au. 

2.  Thresholding.  Let 

„+i_|l  ifu"+2>0, 

[-1  if un+2  <  0. 

Our  interest  in  this  algorithm  comes  from  its  relation  to  the  basic  (unmodified)  Allen-Cahn  equation: 

3  u  1  . 

=  eAu--<P\u).  (11) 

3 1  € 

One  can  use  a  simple  time-splitting  scheme,  in  which  one  first  propagates  using  the  first  term  of  the  right  side  of  (11)  and 
then  the  second  one,  to  evolve  the  Allen-Cahn  equation.  Note,  however,  that  in  the  e  — ►  0  limit,  the  second  step  is  simply 
thresholding  [13].  Thus,  as  e  ->  0,  this  time  splitting  scheme  amounts  to  alternating  between  diffusion  and  thresholding 
steps,  exactly  the  MBO  scheme.  Moreover,  [14]  shows  that,  as  e  ->  0,  the  rescaled  solutions  ue(z,  t/e)  of  (11)  yield  motion 
by  mean  curvature  of  the  interface  between  the  two  phases  of  the  solutions. 

Barles  [15]  and  Evans  [16]  have  proven  rigorously  that  the  MBO  scheme  approximates  motion  by  mean  curvature. 

3.1.  Graph  formulation 

The  motion  by  mean  curvature  of  the  MBO  scheme  was  generalized  to  graphs  by  Merkurjev  et  al.  in  [2],  where  a  modified 
MBO  scheme  on  graphs  has  been  used  to  formulate  an  algorithm  for  binary  classification  and  image  processing.  The  authors 
apply  a  two-step  time  splitting  scheme  to  (4)  so  that  the  second  step  is  the  same  as  the  one  in  the  original  MBO  scheme,  and 
then  replace  the  Au  term  with  a  more  general  graph  term  — Lsu.  The  result  is  a  scheme  that  alternates  between  propagation 
using  the  modified  Allen-Cahn  Eq.  (4)  on  graphs  and  thresholding. 

3.2.  Extension  to  multiclass  segmentation 

The  standard  Gibbs-simplex  UK  defined  in  Eq.  (8)  allows  the  multiclass  extension  of  the  algorithm  in  [2]  to  be  straight¬ 
forward.  Step  2  of  the  method  above  is  modified  so  that  the  thresholding  now  takes  the  form  of  a  displacement  towards  the 
vertex  in  the  Gibbs  simplex  closest  to  the  projection  of  the  result  of  the  diffusion  step  using  the  model  in  [  1 1  ].  In  summary, 
using  the  same  notation  as  in  Section  2,  the  new  algorithm  consists  of  alternating  between: 

1.  Heat  equation  with  forcing  term: 

XXn+\  —  un  ,l 

- - - =  —  Lsun+2  -  fi( Un  -  u).  (12) 

dt 

2.  Thresholding: 

<+1=ek,  (13) 

n+i 

where  vertex  ek  is  the  vertex  in  the  simplex  closest  to  the  projection  of  u-  2  using  [11]. 

The  scheme  is  solved  using  spectral  methods.  The  initialization  procedure  and  the  stopping  criterion  are  the  same  as  in 
Section  2.  In  practice,  the  diffusion  step  can  be  repeated  several  times,  denoted  by  Afe,  before  the  thresholding  step.  Finally, 
node  \  is  assigned  to  class  k  if  the  fcth  component  of  u;  is  one. 

In  practice,  the  complexity  of  this  algorithm  is  close  to  linear  as  well  [3]. 

4.  Numerical  results 

4.1.  Previous  results  for  MNIST,  WebKB,  COIL  and  three  moons  datasets 

We  first  summarize  the  results  in  [3],  which  contains  a  more  detailed  description  of  our  algorithms.  The  main  datasets 
used  in  the  paper  were  the  WebKB  [17],  MNIST  [18]  and  COIL  [19]  benchmark  datasets.  For  WebKB,  multiclass  MBO  and 
GL  achieved  an  accuracy  of  88.48%  and  87.2%,  respectively.  This  was  compared  with  supervised  learning  methods  reported 
in  [20],  such  as  centroid-normalized  sum  (82.66%),  naive  Bayes  (83.52%)  and  SVM  (linear  kernel)  (85.82%).  For  these  meth¬ 
ods,  two  thirds  of  the  data  was  used  for  training,  and  one  third  for  testing.  Our  methods  obtain  higher  accuracy  using  only 
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(a)  Swiss  roll  dataset.  (b)  Spectral  clustering  result  (49%).  (c)  Our  method’s  result  (94.5%). 

Fig.  1.  Swiss  roll  dataset  results. 

20%— 25%  fidelity.  For  the  MNIST  dataset,  the  accuracy  of  the  two  methods  was  96.91%  and  96.8%,  respectively,  which  was 
comparable  to  the  results  obtained  by  neutral/convolutional  nets,  boosted  stumps  and  nonlinear  classifiers,  and  higher  than 
that  of  methods  such  as  transductive  classification  and  Cheeger  cuts.  For  the  COIL  dataset,  multiclass  MBO  and  GL  were  able 
to  achieve  higher  accuracy  (91 .46%  and  91.2%,  respectively)  when  compared  to  some  of  the  best  methods,  like  MP,  SQ-Loss-I 
and  sGT,  mentioned  in  [21].  The  paper  also  presents  a  co-segmentation  example  using  an  image,  as  well  as  results  for  the 
synthetic  “three  moons”  dataset.  For  the  latter,  the  results  were  99.12%  and  98.1%  for  multiclass  MBO  and  GL  methods, 
respectively. 


4.2.  Swiss  roll 


The  Swiss  roll  dataset,  pictured  in  Fig.  la,  contains  1600  3D  points  arranged  in  spirals.  To  calculate  the  weight  matrix,  we 
used  the  weight  function  in  [22]  with  10  nearest  neighbors.  To  calculate  the  eigenvectors,  we  used  the  procedure  in  [23]. 
The  parameters  used  were:  50  eigenvectors,  C  =  51, 6  =  1,  At  =  0.1,  /x  =  50  and  r\  —  10-7  with  80  fidelity  points  (5%). 

Averaged  over  100  runs,  the  average  accuracies  obtained  were  94.1%  and  91.8%  for  the  multiclass  GL  and  MBO  methods, 
respectively.  The  visual  result  of  the  latter  method  is  included  in  Fig.  lc.  We  compare  this  result  to  the  one  obtained  using 
spectral  clustering  (Fig.  lb):  average  accuracy  was  only  49.75%  over  100  runs,  with  the  graph  being  the  same  as  for  multiclass 
GL  and  MBO  cases.  Calculations  were  done  with  4  eigenvectors. 


4.3.  Landsat  satellite  dataset 

This  database  contains  multi-spectral  values  of  pixels  in  neighborhoods  in  a  satellite  image.  To  calculate  the  weight 
matrix,  we  used  the  weight  function  in  [22]  with  30  nearest  neighbors.  To  calculate  the  eigenvectors,  we  used  the  procedure 
in  [23].  The  parameters  used  were:  200  eigenvectors,  C  =  51,  e  =  1,  dt  =  0.1,  /x  =  50  and  rj  =  10-7  with  350  fidelity 
points  (5.44%). 

Averaged  over  30  runs,  the  average  accuracies  obtained  were  87.05%  and  87.25%  for  the  multiclass  GL  and  MBO  methods, 
respectively.  We  compare  these  results  to  several  supervised  learning  methods  listed  in  [24].  These  algorithms  were 
performed  using  80%  training  and  20%  validation.  The  results  were:  65.15%  using  SC-SVM,  75.43%  using  SH-SVM,  65.88% 
using  S-LS,  86.65%  using  simplex  boosting,  and  90.15%  using  S-LS  rbf.  We  outperform  all  but  the  last  algorithm  using  only 
5%  fidelity,  while  these  algorithms  use  80%  for  training. 


4.4.  Human  activity  dataset 


This  dataset  from  the  UCI  Machine  Learning  Repository  contains  information  about  experiments  carried  out  with  30 
volunteers.  Each  person  performed  one  of  six  actions:  walking,  walking  upstairs,  walking  downstairs,  sitting,  standing  and 
laying  while  wearing  a  smartphone  on  the  waist.  The  smartphone  recorded  their  linear  acceleration  and  3-axial  angular 
velocity.  The  goal  is  to  segment  the  people  into  six  classes  according  to  activity  using  the  information  obtained  from  the 
phone.  We  used  the  weight  function  in  [22]  with  59  nearest  neighbors.  The  parameters  used  were:  50  eigenvectors,  C  =  160, 
dt  =  0.31,  /x  =  159  and  77  =  10-7  with  5%  of  points  being  fidelity  points. 

The  average  accuracy  was  89.7%  for  the  multiclass  MBO  method,  while  being  88.7%  for  the  GL  method.  We  compare  this 
to  the  results  of  [25],  where  the  methods  MC-SVM  and  MC-HF-SVM  have  accuracies  of  89.3%  and  89.0%,  respectively.  It  is 
important  to  note  that  the  results  of  the  paper  were  obtained  using  supervised  learning  methods  where  70%  of  the  data  was 
used  for  training  and  the  rest  for  testing.  We  obtain  higher  accuracy  (with  multiclass  MBO)  using  only  5%  fidelity. 
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